Optimizing Abstaining Classifiers using ROC Analysis. Tadek Pietraszek / 'tʌ dek pɪe 'trʌ ʃek / ICML 2005 August 9, 2005

Size: px

Start display at page:

Download "Optimizing Abstaining Classifiers using ROC Analysis. Tadek Pietraszek / 'tʌ dek pɪe 'trʌ ʃek / ICML 2005 August 9, 2005"

Curtis Warren
5 years ago
Views:

1 IBM Zurich Research Laboratory, GSAL Optimizing Abstaining Classifiers using ROC Analysis Tadek Pietraszek / 'tʌ dek pɪe 'trʌ ʃek / pie@zurich.ibm.com ICML 2005 August 9, 2005

2 To classify, or not to classify: that is the question. 2 August 9, 2005

3 Motivation! Abstaining classifiers are classifiers that in certain cases can refrain from classification and are similar to human experts who can say I don t know.! In many domains such experts are preferred to the ones that always make a decision and are sometimes wrong (think doctor ).! Machine learning has frequently used abstaining classifiers ([FH04], [GL00], [PMAS94], [Tort00]) also implicitly (e.g., active learning, delegating classifiers, triskels (ICML05)).! Q1: How do we optimally select abstaining classifiers?! Q2: How do we compare normal and abstaining classifiers? 3 August 9, 2005

4 Outline 2. Tri-State Classifier 1. Model 2. Model 3. Model 4 August 9, 2005

5 Notation! Binary classifier C is a function : i α {,-}, where i I is an instance! Ranker R (a.k.a scoring classifier) is a function attaching rank to an instance i α R, can be converted to a binary classifier C τ using i : C τ (i) = R(i) τ! Abstaining binary classifier A is a classifier that in certain case can refrain from classification. We denote it as attaching a third class?. 5 August 9, 2005

6 ROC Background! Evaluate model performance under all class and cost distributions 2D plot (X false positive rate, Y true positive rate) Classifier C corresponds to a single point on the ROC curve (fp, tp).! Classifier C τ (or a machine learning method L τ ) has a parameter τ, varying which produces multiple points.! Therefore we consider a ROC curve a function f : τ α(fp τ, tp τ ).! Can find an inverse function f -1 : (fp τ, tp τ ) α τ 6 August 9, 2005

7 ROC Background! ROC Convex Hull A piecewise-linear convex down curve f R, having the following properties: f R (0) = 0, f R (1) = 1 Slope of f R is monotonically non-increasing. Assume that for any value m, there [PF98] exists f R (x) = m. Vertices have ``slopes assuming values between the slopes of adjacent edges Assume sentinel edges: 0 th edge with a slope and (n1) th edge with a slope 0. We will use ROCCH instead of ROC. 7 August 9, 2005

8 Some Definitions! Confusion Matrix tp TP = fp TP FN FN fn = TP FN = FP FP TN A/C - TP FP - FN TN P N! Cost Matrix A/C - CR = c c c 21 c 12 0 A = Actual, C = Classified as 8 August 9, 2005

9 Cost Minimizing Criteria for One Classifier! Known iso-performance lines [PF98] ( fp) f ROC = CR N P 9 August 9, 2005

10 Outline 2. Tri-State Classifier 1. Model 2. Model 3. Model 10 August 9, 2005

11 Metaclassifier A α,β! IDEA: Construct the classifier as follows: A α, β ( x) =? C ( x) = where C α, C β is such that: x : ( C α ( C ( C ( x) = ) ( C ( x) = ) C α β α ( x) = ( x) = C β β ( x) = C β ( x) = ) α ( x) = ) C α - - C β - - Result? Impossible -! Can we optimally select C α, C β? 11 August 9, 2005

12 Requirements on the ROC Curve Requirement: for a ROC curve and any two classifiers C α and C β corresponding to points (fp α, tp α ) and (fp β, tp β ) such that fp α fp β x : ( C α ( x) = C ( C β β ( x) = C ( x) = ) α ( x) = )! Conditions are the same used by [FlachWu03] and are met in particular if classifiers C α and C β are constructed from a single ranker R. 12 August 9, 2005

13 Optimal Metaclassifier A α,β! How do we compare binary classifiers and abstaining classifiers? How to select an optimal classifier?! No clear answer Use cost based model ( Model) Use boundary conditions: Maximum number of instances classified as? (Bounded- Abstention Model) Maximum misclassification cost ( Model) 13 August 9, 2005

14 Model! Cost Matrix A/C - 0 c 21 - c 12 0? c 13 c 23 C α A/C - TP α FP α - FN α TN α C β! Important properties ( fp )( ) α fpβ fpβ fpα ( fn fn )( fn fn ) β α β α A/C - TP β FP β - FN β TN β A = Actual, C = Classified as 14 August 9, 2005

15 Selecting the Optimal Classifier! Similar criteria minimize the cost 1 rc = FN c N P β fnβ, fnα rc rc = 0 = 0 FP FP f f ROC ROC β ( ( fp fp β α ) ) = = c c α c23 c c c FPα c N P N P 21 fpα, fpβ ( FP FP ) c ( FN FN ) 1 β α disagree misclass. β 1 β c α disagree misclass. α 15 August 9, 2005

16 Model a Simulated Example ROC curve with two optimal classifiers Misclassification cost for different combinations of A and B TP Classifier A f f ROC ROC ( ( Classifier B fp fp β α ) ) = = c c c23 c c c N P N P Cost FP(a) FP(b) FP 16 August 9, 2005

17 Understanding Cost Matrices! 2x2 cost matrix is well known. 2x3 cost matrices has some interesting properties: e.g., under which conditions the optimal classifier is an abstaining classifier.! Our derivation is valid for ( c c ) ( c > c ) ( c c c c c ) c12 we can prove that if this condition is not met the classifier is a trivial binary classifier 17 August 9, 2005

18 Cost Matrices Interesting Cases! How to set c 13, c 23 so that the classifier is a nontrivial abstaining classifier?! Two interesting cases Symmetric case (c 13 =c 23 ) c 13 = c 23 c c c c Proportional case (c 13 / c 23 = c 12 / c 21 ) c 13 c12 2 c 23 c August 9, 2005

19 Bounded Models! Problem: 2x3 cost matrix is not always given and would have to be estimated. However, classifier is very sensitive to c 13, c 23.! Finding other optimization criteria for an abstaining classifier using a standard cost matrix. Calculate misclassification costs per classified instance! Follow the same reasoning to find the optimal classifier 19 August 9, 2005

20 Bounded Models Equation! Obtained the following equation, determining the relationship between k and rc for as a function of classifiers C α, C β. rc k = = 1 ( )( ) ( FP ) αc21 FN βc12 1 k N P 1 (( fp fp ) ( fn fn ) N P β Constrain k, minimize rc bounded-abstention Constrain rc, minimize k bounded-improvement! No algebraic solution, need to optimize numerically. α α β 20 August 9, 2005

21 Model! Among classifiers abstaining for no more than a fraction of k MAX instances find the one that minimizes rc.! Useful application in real-time processing instances where the non-classified instances will be processed by another classifier with a limited processing speed.! Can prove that the solution is not limited to vertices of ROCCH. 21 August 9, 2005

22 Model a Simulated Example ROC curve with two optimal classifiers Misclassification cost (tp fp). Bounded case? <= 0.2 TP Classifier B Classifier A Cost FP(a) FP(b) FP 22 August 9, 2005

23 Model! Among classifiers having misclassification cost not higher than rc MAX, find the one that abstains for the smallest number of instances.! Useful in, e.g. medical domain where having a test want to achieve a certain lower misclassification cost allowing for non-classified instances.! For the evaluation use f, such that rc MAX = (1-f)rc, where rc is the cost of the optimal binary classifier.! Can prove that the solution is not limited to vertices of ROCCH. 23 August 9, 2005

24 Moded a Simulated Example ROC curve with two optimal classifiers Fraction of skipped instances for different combinations of A and B TP Classifier Classifier A B Skipped Fraction FP(b) FP(a) FP 24 August 9, 2005

25 Experiments! Tested with 15 UCI KDD datasets, using averaged cross-validation.! In each model used one independent parameter c 13 =c 23, k or f.! Classifier Bayesian classifier from Weka [WF00].! Numerical calculations and optimization in R.! Showing results for one representative dataset. 25 August 9, 2005

26 Building an Abstaining Classifier training instances (1) 2x3 cost matrix (2) 2x2 cost matrix, fraction k or f n-fold Crossvalidation training set testing set Build Classifier Classify Collect Statistics Find Thresholds* ROC Build ROC thresholds Construct Tri-State Classifier Abstaining classifier (for each fold) repeat m-times and average Binary classifier Build Classifier 26 August 9, 2005

27 Results Model ionosphere.arff ionosphere.arff ionosphere.arff cost improvement fraction instances skipped cost improvement cost value c13=c cost value c13=c fraction instances skipped 27 August 9, 2005

28 Results Model ionosphere.arff ionosphere.arff relative cost improvement misclassification cost (rc) fraction skipped (k) fraction skipped (k) 28 August 9, 2005

29 Results Model ionosphere.arff ionosphere.arff fraction skipped (k) fraction skipped (k) relative cost improvement (f) misclassification cost (rc) 29 August 9, 2005

30 Summary! Abstaining classifier as a metaclassifier Cost-based model Bounded-improvement model Bounded-abstention model! Methodically tested and proved it works (in all three models) Multiple data sets (UCI KDD) Cross-validation! Idea fits our alert classification system (see: Pietraszek 2004, Using Adaptive Alert Classification to Reduce False Positives in Intrusion Detection ) 30 August 9, 2005

31 IBM Zurich Research Laboratory, GSAL END

32 Bibliography (1)! [Chow70] Chow, C. (1970). On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16, ! [Dietterich98] Dietterich, T. G. (1998). Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10, ! [Fawcett03] Fawcett, T. (2003). ROC graphs: Note and practical considerations for researchers (HPL ) (Technical Report). HP Laboratories.! [FFH04] Ferri, C., Flach, P., Hernandez-Orallo, J. (2004). Delegating classifiers. Proceedings of 21th International Conference on Machine Leaning (ICML'04) (pp ). Alberta, Canada: Omnipress.! [FerriHernandez04] Ferri, C., Hernandez-Orallo, J. (2004). Cautious classifiers. Proceedings of ROC Analysis in Artificial Intelligence, 1st International Workshop (ROCAI-2004) (pp ). Valencia Spain.! [FlachWu03] Flach, P.A., Wu, S. (2003). Repairing concavities in ROC curves. Proc UK Workshop on Computational Intelligence (pp ). Bristol, UK.! [GambergerLavrac00] Gamberger, D., Lavrac, N. (2000). Reducing misclassification costs. Principles of Data Mining and Knowledge Discovery, 4th European Conference (PKDD 2000) (pp ). Lyon, France: Springer Verlag.! [HettichBay99] Hettich, S., Bay, S. D. (1999). The UCI KDD Archive. Web page at [LewisCatlett94] Lewis, D.D., Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. Proceedings of ICML-94, 11th International Conference on Machine Learning (pp ). Morgan Kaufmann Publishers, San Francisco, US. 32 August 9, 2005

33 Bibliography (2)! [NedlerMead65] Nedler, J., Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, ! [PMAS94] Pazzani, M.J., Murphy, P., Ali, K., Schulenburg, D. (1994). Trading off coverage for accuracy in forecasts: Applications to clinical data analysis. Proceedings of AAAI Symposium on AI in Medicine (pp ). Stanford, CA.! [ProvostFawcett98]Provost, F., Fawcett, T. (1998). Robust classification systems for imprecise environemnts. Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98) (pp ). AAAI Press.! [Tortorella00] Tortorella, F. (2000). An optimal reject rule for binary classifiers. Advances in Pattern Recognition, Joint IAPR International Workshops SSPR 2000 and SPR 2000 (pp.\/ ). Alicante, Spain: Springer-Verlag.! [WittenFrank00] Witten, I.H., Frank, E. (2000). Data Mining: Practical machine learning tools with Java implementations. San Francisco: Morgan Kaufmann. 33 August 9, 2005

34 Further Improvements in: and Models! In previous work, we used general numerical methods to find the solution! But: ROCCH is not an arbitrary function, but has special properties Thus, we can do much better and understand the tri-state classifiers better.! Proposed an algorithm and a proof (see paper). 34 August 9, 2005

35 Optimal Classifier Path Optimal classifier path bounded abstention Cost FP(a) FP(b) August 9, 2005

36 Cost Algorithm Bounded Abstention Model Smallest relative gradient path bounded abstention FP(b) FP(a) August 9, 2005

37 k Algorithm Model Optimal classifier path bounded improvement FP(b) FP(a) 37 August 9, 2005

38 Selecting the Optimal Classifier! Criteria minimize the misclassification cost rc rc rc = = = d rc d FP 1 N P 1 N P 1 N P = 1 N P ( FP c FN c ) ( FP c P(1 TP) c ) FP c c P 1 P N c f f ROC ROC 12 FP N FP N tp c = TP = P f = f 12 ROC 0 ( fp) ROC FP N 38 August 9, 2005

39 Cost Matrices! Theorem. If (*) is not met, the classifier is a trivial binary classifier. ( c c ) ( c > c ) ( c c c c c ) (*) c12! Proof (sketch) show that for an optimal classifier f R (fp * α ) f R (fp* ) f R (fp * β ), where fp* corresponds to an optimal binary classifier. show that if (*) is not met, fp is positive for fp * α < fp* α rc and that is positive for fp * β > fp* fp β therefore fp * α = fp* = fp * β rc 39 August 9, 2005

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit